compute loss only if training and update token metric naming by ved1beta · Pull Request #3293 · axolotl-ai-cloud/axolotl

ved1beta · 2025-12-02T05:32:33Z

fixes #3291

Summary by CodeRabbit

Bug Fixes
- Fixed token-per-second tracking to only occur when the model is actively training. Previously, tracking could run regardless of training state.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-02T05:32:49Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

The compute_loss method in the base trainer now conditionally enables token-per-second tracking based on model training state. Token-per-second metrics are now computed only when model.training is True, instead of whenever the include_tkps flag was set.

Changes

Cohort / File(s)	Summary
Token-per-second tracking condition `src/axolotl/core/trainers/base.py`	Modified `compute_loss` to enable tokens-per-second counting only during training mode (`model.training is True`), adding a conditional guard around previously unconditional token tracking logic

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15–20 minutes

Extra attention areas:
- Verify the impact on metrics reporting during evaluation and inference phases
- Confirm that suppressing token-per-second tracking during non-training modes doesn't break logging or monitoring pipelines
- Check for any side effects on the loss computation flow when model.training is False

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title partially addresses the changeset: it mentions computing loss only if training, which aligns with the main change (enabling tokens-per-second tracking only in training mode), but also mentions 'update token metric naming' which is not reflected in the provided summary of changes.	Clarify whether 'update token metric naming' is part of this PR or should be removed from the title. If it is included, provide details on which metrics were renamed.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/axolotl/core/trainers/base.py (1)
351-369: Gating tkps tracking on model.training aligns with intent; consider edge‑cases and minor cleanup

Conditioning the token accounting on model.training achieves the goal of only tracking tokens during actual training steps and avoids polluting metrics during eval/inference. This should be compatible with the standard Trainer.train() / Trainer.evaluate() flow where the trainer toggles model.train() / model.eval().

Two minor points to keep in mind:

If you have any custom code paths that call compute_loss for “training-like” work while leaving model.training == False (e.g., manual scoring runs), tkps will now be skipped there; worth confirming that you don’t rely on tkps in those flows.

Optional micro‑refactor: you can avoid recomputing the mask sum when updating self.state.num_tokens by reusing num_tokens (or its .cpu() copy), e.g.:
-        if self.args.include_tkps and model.training:
-            inputs_key = "labels" if "labels" in inputs else "input_ids"
-            num_tokens = (inputs[inputs_key] != -100).sum()
+        if self.args.include_tkps and model.training:
+            inputs_key = "labels" if "labels" in inputs else "input_ids"
+            num_tokens = (inputs[inputs_key] != -100).sum()
+            num_tokens_cpu = num_tokens.cpu()
@@
-            if hasattr(self.state, "num_tokens"):
-                self.state.num_tokens = (
-                    self.state.num_tokens + (inputs[inputs_key] != -100).sum().cpu()
-                )
-            else:
-                self.state.num_tokens = (inputs[inputs_key] != -100).sum().cpu()
+            if hasattr(self.state, "num_tokens"):
+                self.state.num_tokens = self.state.num_tokens + num_tokens_cpu
+            else:
+                self.state.num_tokens = num_tokens_cpu

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c6ddcdd and 2a558f6.

📒 Files selected for processing (1)

src/axolotl/core/trainers/base.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
GitHub Check: PyTest from Source Dist (3.11, 2.9.0)
GitHub Check: PyTest from Source Dist (3.11, 2.8.0)
GitHub Check: PyTest (3.11, 2.9.0)
GitHub Check: PyTest (3.11, 2.7.1)
GitHub Check: PyTest (3.11, 2.8.0)

codecov · 2025-12-02T05:43:42Z

Codecov Report

❌ Patch coverage is 21.27660% with 37 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/utils/callbacks/tokens_per_second.py	27.58%	21 Missing ⚠️
src/axolotl/core/trainers/base.py	11.11%	16 Missing ⚠️

📢 Thoughts on this report? Let us know!

xzuyn · 2025-12-02T19:20:22Z

Solves the eval increase part, but the reset on resume still happens.

Eval every 10 steps.

ved1beta · 2025-12-03T08:02:16Z

ahh i thought it was just the recomputing here you goo ,

NanoCode012 · 2025-12-04T15:01:24Z

+                tokens_state = json.load(f)
+            state.total_tokens = torch.tensor(tokens_state.get("total_tokens", 0))
+            state.num_tokens = torch.tensor(tokens_state.get("num_tokens", 0))
+            LOG.info(f"Restored total_tokens: {state.total_tokens}")


Is this how we should store total_tokens? Are we able to inject it into the trainer state file that gets created in checkpoints?

ummm TrainerState is a dataclass, it only serializes defined fields

xzuyn · 2025-12-06T19:20:09Z

Resuming works too. [2025-12-06 13:23:07,478] [INFO] [axolotl.utils.callbacks.tokens_per_second] Restored total_tokens: 544092

NanoCode012

As mentioned in chat, let's refactor to use tokens/total and tokens/trainable. I believe self.num_tokens is redundant?

ved1beta · 2025-12-09T12:56:37Z

train eval working fine ,also checkpoints

NanoCode012 · 2025-12-16T05:18:49Z

+                self.state.tokens["total"] + torch.as_tensor(total_tokens).cpu()
+            )
+            # Store per-step trainable tokens for throughput calculation
+            self.state.tokens["trainable_step"] = trainable_tokens.detach().cpu()


Can be removed as is unused (not logged and too similar to others)

NanoCode012 · 2025-12-16T05:37:36Z

                self.state.last_tokens_per_second.item() / self.args.logging_steps, 2
            )
-            logs["total_tokens"] = int(self.state.total_tokens.item())
+            logs["tokens/total"] = int(self.state.tokens["total"].item())


I think we missed log tokens/trainable

NanoCode012 · 2025-12-16T08:03:39Z

+        if tokens and "total" in tokens:
+            logs["tokens/total"] = tokens["total"].item()
+
+        if tokens and "trainable" in tokens:
+            logs["tokens/trainable"] = tokens["trainable"].item()


Is this duplicate log of base.py L651-652?

yess. redundant ?

winglian · 2025-12-23T15:25:09Z

can we add a CI similar to TestResumeLlama that on resume the total tokens is correct?

ved1beta · 2025-12-23T15:41:00Z

Okay!

* upgrade dependencies * don't use reset sessions * downgrade transformers, upgrade other deps * upgrade bnb to 0.49.0 * restore s3 cache * explicit use local files w hub * decompress and strip top level dir * use 2 levels for strip components * try to preserve permissions for symlinks * use updated tar * fix #3293 for distributed * downgrade bnb * fast fail after 4 * fix total tokens device * patch accelerate CP/SP (#3309) --------- Co-authored-by: salman <salman.mohammadi@outlook.com>

xzuyn · 2026-01-05T19:44:46Z

tokens/total and tokens/trainable aren't being sent to wandb. It is also being shown as a float instead of an int.

compute loss only if training

2a558f6

coderabbitai Bot reviewed Dec 2, 2025

View reviewed changes

Ved added 2 commits December 3, 2025 13:20

save total_tokens for checkpiont

1a161a5

check if string

b62171b

NanoCode012 reviewed Dec 4, 2025

View reviewed changes

Merge branch 'main' into training_loss

2e2c365

NanoCode012 requested changes Dec 9, 2025

View reviewed changes

Ved and others added 2 commits December 9, 2025 18:24

refactor total_tokens/ num_tokens

8f9c8dd

Merge branch 'main' into training_loss

044e700

refactor 2

bd8a9ce

ved1beta requested a review from NanoCode012 December 11, 2025 11:49

NanoCode012 reviewed Dec 16, 2025

View reviewed changes

Ved and others added 3 commits December 16, 2025 11:44

rplc trainable_step/trian_per_sec_per_gpu

5ea1344

lint + log trainable/tokens

13761fc

Merge branch 'main' into training_loss

108db3b

NanoCode012 reviewed Dec 16, 2025

View reviewed changes

Ved and others added 2 commits December 19, 2025 15:03

consolidate it in the callback.

79f247d

Merge branch 'main' into training_loss

8ff36b7

ved1beta requested a review from NanoCode012 December 19, 2025 09:38

Merge branch 'main' into training_loss

373c841

NanoCode012 approved these changes Dec 22, 2025

View reviewed changes

NanoCode012 added the ready to merge label Dec 23, 2025

NanoCode012 changed the title ~~compute loss only if training~~ compute loss only if training and update token metric naming Dec 23, 2025

test for total_tokes aftr remuse

a7d6b7f

NanoCode012 added the hold don't merge this yet label Dec 25, 2025

check if tokenstate exist after ckpt

8f3f6c4

NanoCode012 merged commit a6080df into axolotl-ai-cloud:main Dec 25, 2025
9 checks passed

winglian added a commit that referenced this pull request Dec 29, 2025

fix #3293 for distributed

ad8c1bf

ved1beta mentioned this pull request Jan 6, 2026

fix total/trainable tokens log #3344

Merged

coderabbitai Bot mentioned this pull request Jan 9, 2026

feat(dft): Add Dynamic Fine-Tuning (DFT) plugin with comprehensive compatibility support #3348

Closed

8 tasks

winglian removed the ready to merge label Mar 22, 2026

Uh oh!

Conversation

ved1beta commented Dec 2, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xzuyn commented Dec 2, 2025

Uh oh!

ved1beta commented Dec 3, 2025

Uh oh!

NanoCode012 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

ved1beta Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

xzuyn commented Dec 6, 2025

Uh oh!

NanoCode012 left a comment

Choose a reason for hiding this comment

Uh oh!

ved1beta commented Dec 9, 2025

Uh oh!

NanoCode012 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ved1beta Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

winglian commented Dec 23, 2025

Uh oh!

ved1beta commented Dec 23, 2025

Uh oh!

Uh oh!

xzuyn commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ved1beta commented Dec 2, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Dec 2, 2025 •

edited

Loading

codecov Bot commented Dec 2, 2025 •

edited

Loading

xzuyn commented Jan 5, 2026 •

edited

Loading